Introduction

We are designing a dynamic analysis system for a database containing clinical, genomic, and proteomic data from 200 patients with Myotonic Dystrophy Type 1 (DM1), collected across 25 hospitals and entered by 13 different personnel.

Myotonic Dystrophy Type 1 is the most common adult-onset muscular dystrophy. It is a multi-systemic disorder: beyond progressive muscle weakness, patients experience cardiac conduction defects, endocrine abnormalities, cataracts, and cognitive impairment. A well-designed analytical system must therefore support exploration across multiple clinical domains and detect subtle longitudinal changes that unfold over months to years.

Data source & Environment setup

We use the publicly available random.cdisc.data package from the pharmaverse ecosystem, which generates synthetic CDISC ADaM datasets that mirror the structure of a real multi-centre clinical trial.

All dependencies are declared in the project’s Nix flake, so by using nix develop every package below is available without manual installation.

## All packages are provided by the Nix flake — no manual installation needed.
## If you run this report outside `nix develop`, uncomment the block below:
#
# install.packages(
#   c("tidyverse", "survival", "random.cdisc.data",
#     "gridExtra", "broom", "scales",
#     "plotly", "htmltools", "webshot2"),
#   repos = "https://cloud.r-project.org"
# )

library(tidyverse)
library(survival)
library(random.cdisc.data)
library(gridExtra)
library(broom)
library(scales)

# Detect output format: interactive plotly for HTML, static ggplot for markdown
is_html <- knitr::is_html_output()
if (is_html) {
  library(plotly)
  library(htmltools)
}

set.seed(42)

# To render this document
# Rscript -e 'rmarkdown::render("small_exploratory_analysis/exploratory_analysis.Rmd")'

1. Data loading and preparation: Discovering available datasets

The full catalogue of cached datasets can be enumerated programmatically. This is useful when onboarding new team members or scripting automated pipelines that must know what data is available:

Click to expand: all cached datasets
# List every cached dataset shipped with random.cdisc.data
available <- data(package = "random.cdisc.data")$results
knitr::kable(data.frame(
  Dataset     = available[, "Item"],
  Description = available[, "Title"],
  stringsAsFactors = FALSE
))
Dataset Description
cadab Cached ADAB
cadae Cached ADAE
cadaette Cached ADAETTE
cadcm Cached ADCM
caddv Cached ADDV
cadeg Cached ADEG
cadex Cached ADEX
cadhy Cached ADHY
cadlb Cached ADLB
cadmh Cached ADMH
cadpc Cached ADPC
cadpp Cached ADPP
cadqlqc Cached ADQLQC
cadqs Cached ADQS
cadrs Cached ADRS
cadsl Cached ADSL
cadsub Cached ADSUB
cadtr Cached ADTR
cadtte Cached ADTTE
cadvs Cached ADVS

From detailed inspection we determine that this package ships with 20 cached datasets covering domains such as adverse events (ADAE), ECG (ADEG), exposure (ADEX), medical history (ADMH), pharmacokinetics (ADPC/ADPP), questionnaires (ADQS), tumour response (ADRS/ADTR), and more.

For this exploratory pass we select four that map naturally to the DM1 clinical context—as outlined in the consensus-based care recommendations for adults with DM1 (Ashizawa et al., 2018), which emphasise regular cardiac monitoring, hepatic and endocrine surveillance, and longitudinal safety trackinG:

Dataset CDISC domain Why selected for DM1 Key variables
cadsl ADSL Demographics & baseline—needed for any analysis and to check arm balance AGE, SEX, RACE, ARM, BMRKR1/2
cadvs ADVS Vital signs over time—cardiovascular monitoring is critical in DM1 (cardiac conduction defects) SYSBP, DIABP, PULSE per visit
cadlb ADLB Lab biomarkers—hepatic (ALT) and inflammatory (CRP) surveillance relevant to DM1 therapies ALT, CRP, IGA per visit
cadaette ADAETTE Time-to-adverse-event—the most information-rich safety comparison across arms Time, event/censor, by arm
data("cadsl");    adsl    <- cadsl
data("cadvs");    advs    <- cadvs
data("cadlb");    adlb    <- cadlb
data("cadaette"); adaette <- cadaette

tibble(
  Dataset  = c("ADSL", "ADVS", "ADLB", "ADAETTE"),
  Rows     = c(nrow(adsl), nrow(advs), nrow(adlb), nrow(adaette)),
  Columns  = c(ncol(adsl), ncol(advs), ncol(adlb), ncol(adaette))
) %>% knitr::kable(align = "lrr")
Dataset Rows Columns
ADSL 400 55
ADVS 16800 87
ADLB 8400 102
ADAETTE 3600 66

Datasets not used here (e.g. ADAE for individual AE listings, ADEG for ECG intervals, ADQS for patient-reported outcomes) would become relevant in a full production analysis; the proposed system is designed to accommodate all of them.


Variable inventory

Before any modelling we need to understand what each dataset contains and which columns are candidates for specific analytical tasks. The table below profiles every variable: its R class, the percentage of non-missing values, the number of distinct values, and a sample entry. This is the kind of metadata catalogue the proposed system would expose through an automated data-dictionary endpoint.

Click to expand: variable inventory for all 4 datasets
profile_vars <- function(df, label) {
  tibble(
    Dataset  = label,
    Variable = names(df),
    Class    = sapply(df, function(x) paste(class(x), collapse = "/")),
    `Non-NA %` = sapply(df, function(x) sprintf("%.0f%%", 100 * mean(!is.na(x)))),
    `Distinct` = sapply(df, n_distinct),
    Example  = sapply(df, function(x) {
      v <- na.omit(x)
      if (length(v) == 0) return("NA")
      as.character(v[1])
    })
  )
}

bind_rows(
  profile_vars(adsl,    "ADSL"),
  profile_vars(advs,    "ADVS"),
  profile_vars(adlb,    "ADLB"),
  profile_vars(adaette, "ADAETTE")
) %>%
  knitr::kable()
Dataset Variable Class Non-NA % Distinct Example
ADSL STUDYID character 100% 1 AB12345
ADSL USUBJID character 100% 400 AB12345-CHN-3-id-128
ADSL SUBJID character 100% 400 id-128
ADSL SITEID character 100% 95 CHN-3
ADSL AGE integer 100% 38 32
ADSL AGEU factor 100% 1 YEARS
ADSL SEX factor 100% 2 M
ADSL RACE factor 100% 6 ASIAN
ADSL ETHNIC factor 100% 4 HISPANIC OR LATINO
ADSL COUNTRY factor 100% 9 CHN
ADSL DTHFL factor 100% 2 Y
ADSL INVID character 100% 95 INV ID CHN-3
ADSL INVNAM character 100% 95 Dr. CHN-3 Doe
ADSL ARM factor 100% 3 A: Drug X
ADSL ARMCD factor 100% 3 ARM A
ADSL ACTARM factor 100% 3 A: Drug X
ADSL ACTARMCD factor 100% 3 ARM A
ADSL TRT01P factor 100% 3 A: Drug X
ADSL TRT01A factor 100% 3 A: Drug X
ADSL TRT02P factor 100% 3 B: Placebo
ADSL TRT02A factor 100% 3 A: Drug X
ADSL REGION1 factor 100% 6 Asia
ADSL STRATA1 factor 100% 3 C
ADSL STRATA2 factor 100% 2 S2
ADSL BMRKR1 numeric 100% 400 14.424933692778
ADSL BMRKR2 factor 100% 3 MEDIUM
ADSL ITTFL factor 100% 1 Y
ADSL SAFFL factor 100% 1 Y
ADSL BMEASIFL factor 100% 2 Y
ADSL BEP01FL factor 100% 2 Y
ADSL AEWITHFL factor 100% 2 N
ADSL RANDDT Date 100% 296 2019-02-22
ADSL TRTSDTM POSIXct/POSIXt 100% 400 2019-02-24 11:09:25.683
ADSL TRTEDTM POSIXct/POSIXt 82% 328 2022-02-12 04:28:08.683
ADSL TRT01SDTM POSIXct/POSIXt 100% 400 2019-02-24 11:09:25.683
ADSL TRT01EDTM POSIXct/POSIXt 82% 328 2021-02-11 22:28:08.683
ADSL TRT02SDTM POSIXct/POSIXt 82% 328 2021-02-11 22:28:08.683
ADSL TRT02EDTM POSIXct/POSIXt 82% 328 2022-02-12 04:28:08.683
ADSL AP01SDTM POSIXct/POSIXt 100% 400 2019-02-24 11:09:25.683
ADSL AP01EDTM POSIXct/POSIXt 82% 328 2021-02-11 22:28:08.683
ADSL AP02SDTM POSIXct/POSIXt 82% 328 2021-02-11 22:28:08.683
ADSL AP02EDTM POSIXct/POSIXt 82% 328 2022-02-12 04:28:08.683
ADSL EOSSTT factor 100% 3 DISCONTINUED
ADSL EOTSTT factor 100% 3 DISCONTINUED
ADSL EOSDT Date 82% 179 2022-02-12
ADSL EOSDY integer 82% 114 1084
ADSL DCSREAS factor 30% 8 DEATH
ADSL DTHDT Date 18% 44 2022-03-06
ADSL DTHCAUS factor 18% 8 ADVERSE EVENT
ADSL DTHCAT factor 18% 4 ADVERSE EVENT
ADSL LDDTHELD integer 18% 36 22
ADSL LDDTHGR1 factor 18% 3 <=30
ADSL LSTALVDT Date 82% 219 2022-03-06
ADSL DTHADY integer 18% 67 1105
ADSL ADTHAUT factor 14% 3 Yes
ADVS STUDYID character 100% 1 AB12345
ADVS USUBJID character 100% 400 AB12345-BRA-1-id-105
ADVS SUBJID character 100% 400 id-105
ADVS SITEID character 100% 95 BRA-1
ADVS AGE integer 100% 38 38
ADVS AGEU factor 100% 1 YEARS
ADVS SEX factor 100% 2 M
ADVS RACE factor 100% 6 BLACK OR AFRICAN AMERICAN
ADVS ETHNIC factor 100% 4 HISPANIC OR LATINO
ADVS COUNTRY factor 100% 9 BRA
ADVS DTHFL factor 100% 2 N
ADVS INVID character 100% 95 INV ID BRA-1
ADVS INVNAM character 100% 95 Dr. BRA-1 Doe
ADVS ARM factor 100% 3 A: Drug X
ADVS ARMCD factor 100% 3 ARM A
ADVS ACTARM factor 100% 3 A: Drug X
ADVS ACTARMCD factor 100% 3 ARM A
ADVS TRT01P factor 100% 3 A: Drug X
ADVS TRT01A factor 100% 3 A: Drug X
ADVS TRT02P factor 100% 3 C: Combination
ADVS TRT02A factor 100% 3 A: Drug X
ADVS REGION1 factor 100% 6 South America
ADVS STRATA1 factor 100% 3 B
ADVS STRATA2 factor 100% 2 S1
ADVS BMRKR1 numeric 100% 400 4.15691403407286
ADVS BMRKR2 factor 100% 3 MEDIUM
ADVS ITTFL factor 100% 1 Y
ADVS SAFFL factor 100% 1 Y
ADVS BMEASIFL factor 100% 2 Y
ADVS BEP01FL factor 100% 2 Y
ADVS AEWITHFL factor 100% 2 N
ADVS RANDDT Date 100% 296 2020-03-08
ADVS TRTSDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADVS TRTEDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADVS TRT01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADVS TRT01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADVS TRT02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADVS TRT02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADVS AP01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADVS AP01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADVS AP02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADVS AP02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADVS EOSSTT factor 100% 3 DISCONTINUED
ADVS EOTSTT factor 100% 3 DISCONTINUED
ADVS EOSDT Date 82% 179 2022-02-14
ADVS EOSDY integer 82% 114 709
ADVS DCSREAS factor 30% 8 PROTOCOL VIOLATION
ADVS DTHDT Date 18% 44 2022-03-16
ADVS DTHCAUS factor 18% 8 ADVERSE EVENT
ADVS DTHCAT factor 18% 4 ADVERSE EVENT
ADVS LDDTHELD integer 18% 36 24
ADVS LDDTHGR1 factor 18% 3 <=30
ADVS LSTALVDT Date 82% 219 2022-03-09
ADVS DTHADY integer 18% 67 496
ADVS ADTHAUT factor 14% 3 Yes
ADVS ASEQ integer 100% 42 1
ADVS VSSEQ integer 100% 42 1
ADVS VSTESTCD factor 100% 6 DIABP
ADVS VSTEST factor 100% 6 Diastolic Blood Pressure
ADVS VSCAT factor 100% 1 VITAL SIGNS
ADVS VSSTRESC character 100% 6 <80
ADVS ASPID integer 100% 16800 14596
ADVS PARAM factor 100% 6 Diastolic Blood Pressure
ADVS PARAMCD factor 100% 6 DIABP
ADVS AVAL numeric 100% 16800 72.5958424490508
ADVS AVALU factor 100% 5 Pa
ADVS BASE2 numeric 100% 2400 72.5958424490508
ADVS BASE numeric 86% 2401 113.744509468754
ADVS BASETYPE factor 100% 1 LAST
ADVS ABLFL2 factor 100% 2 Y
ADVS ABLFL factor 100% 2
ADVS CHG2 numeric 100% 14401 0
ADVS PCHG2 numeric 100% 14401 0
ADVS CHG numeric 86% 12002 0
ADVS PCHG numeric 86% 12002 0
ADVS DTYPE factor 0% 1 NA
ADVS ANRIND factor 100% 3 LOW
ADVS BNRIND factor 100% 3 NORMAL
ADVS ADTM POSIXct/POSIXt 100% 2800 2020-03-26 05:39:28.683
ADVS ADY integer 100% 975 18
ADVS ATPTN integer 100% 1 1
ADVS AVISIT factor 100% 7 SCREENING
ADVS AVISITN integer 100% 7 -1
ADVS LOQFL factor 100% 2 Y
ADVS ONTRTFL factor 100% 2
ADVS ANRLO numeric 100% 6 80
ADVS ANRHI numeric 100% 5 120
ADLB STUDYID character 100% 1 AB12345
ADLB USUBJID character 100% 400 AB12345-BRA-1-id-105
ADLB SUBJID character 100% 400 id-105
ADLB SITEID character 100% 95 BRA-1
ADLB AGE integer 100% 38 38
ADLB AGEU factor 100% 1 YEARS
ADLB SEX factor 100% 2 M
ADLB RACE factor 100% 6 BLACK OR AFRICAN AMERICAN
ADLB ETHNIC factor 100% 4 HISPANIC OR LATINO
ADLB COUNTRY factor 100% 9 BRA
ADLB DTHFL factor 100% 2 N
ADLB INVID character 100% 95 INV ID BRA-1
ADLB INVNAM character 100% 95 Dr. BRA-1 Doe
ADLB ARM factor 100% 3 A: Drug X
ADLB ARMCD factor 100% 3 ARM A
ADLB ACTARM factor 100% 3 A: Drug X
ADLB ACTARMCD factor 100% 3 ARM A
ADLB TRT01P factor 100% 3 A: Drug X
ADLB TRT01A factor 100% 3 A: Drug X
ADLB TRT02P factor 100% 3 C: Combination
ADLB TRT02A factor 100% 3 A: Drug X
ADLB REGION1 factor 100% 6 South America
ADLB STRATA1 factor 100% 3 B
ADLB STRATA2 factor 100% 2 S1
ADLB BMRKR1 numeric 100% 400 4.15691403407286
ADLB BMRKR2 factor 100% 3 MEDIUM
ADLB ITTFL factor 100% 1 Y
ADLB SAFFL factor 100% 1 Y
ADLB BMEASIFL factor 100% 2 Y
ADLB BEP01FL factor 100% 2 Y
ADLB AEWITHFL factor 100% 2 N
ADLB RANDDT Date 100% 296 2020-03-08
ADLB TRTSDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADLB TRTEDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADLB TRT01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADLB TRT01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADLB TRT02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADLB TRT02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADLB AP01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADLB AP01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADLB AP02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADLB AP02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADLB EOSSTT factor 100% 3 DISCONTINUED
ADLB EOTSTT factor 100% 3 DISCONTINUED
ADLB EOSDT Date 82% 179 2022-02-14
ADLB EOSDY integer 82% 114 709
ADLB DCSREAS factor 30% 8 PROTOCOL VIOLATION
ADLB DTHDT Date 18% 44 2022-03-16
ADLB DTHCAUS factor 18% 8 ADVERSE EVENT
ADLB DTHCAT factor 18% 4 ADVERSE EVENT
ADLB LDDTHELD integer 18% 36 24
ADLB LDDTHGR1 factor 18% 3 <=30
ADLB LSTALVDT Date 82% 219 2022-03-09
ADLB DTHADY integer 18% 67 496
ADLB ADTHAUT factor 14% 3 Yes
ADLB ASEQ integer 100% 21 1
ADLB LBSEQ integer 100% 21 1
ADLB LBTESTCD factor 100% 3 ALT
ADLB LBTEST factor 100% 3 Alanine Aminotransferase Measurement
ADLB LBCAT factor 100% 2 CHEMISTRY
ADLB LBSTRESC character 100% 3 <7
ADLB ASPID integer 100% 8400 6364
ADLB PARAM factor 100% 3 Alanine Aminotransferase Measurement
ADLB PARAMCD factor 100% 3 ALT
ADLB AVAL numeric 100% 8400 4.2979212245254
ADLB AVALU factor 100% 3 U/L
ADLB BASE2 numeric 100% 1200 4.2979212245254
ADLB BASE numeric 86% 1201 24.695881839145
ADLB BASETYPE factor 100% 1 LAST
ADLB ABLFL2 factor 100% 2 Y
ADLB ABLFL factor 100% 2
ADLB CHG2 numeric 100% 7201 0
ADLB PCHG2 numeric 100% 7201 0
ADLB CHG numeric 86% 6002 0
ADLB PCHG numeric 86% 6002 0
ADLB DTYPE logical 0% 1 NA
ADLB ANRIND factor 100% 3 LOW
ADLB BNRIND factor 100% 3 NORMAL
ADLB SHIFT1 factor 100% 10
ADLB ATOXGR factor 100% 9 -4
ADLB BTOXGR factor 100% 9 0
ADLB ADTM POSIXct/POSIXt 100% 2800 2020-05-27 05:39:28.683
ADLB ADY integer 100% 976 80
ADLB ATPTN integer 100% 1 1
ADLB AVISIT factor 100% 7 SCREENING
ADLB AVISITN integer 100% 7 -1
ADLB LOQFL factor 100% 2 Y
ADLB ONTRTFL factor 100% 2
ADLB WORS01FL factor 100% 2
ADLB WGRHIFL factor 100% 2
ADLB WGRLOFL factor 100% 2
ADLB WGRHIVFL factor 100% 2
ADLB WGRLOVFL factor 100% 2
ADLB ANL01FL factor 100% 2
ADLB ANRLO numeric 100% 3 7
ADLB ANRHI numeric 100% 3 55
ADLB BTOXGRL factor 100% 6 0
ADLB BTOXGRH factor 100% 6 0
ADLB ATOXGRL factor 100% 6 4
ADLB ATOXGRH factor 100% 6
ADLB ATOXDSCL character 0% 1 NA
ADLB ATOXDSCH character 100% 3 Alanine aminotransferase increased
ADAETTE STUDYID character 100% 1 AB12345
ADAETTE USUBJID character 100% 400 AB12345-BRA-1-id-105
ADAETTE SUBJID character 100% 400 id-105
ADAETTE SITEID character 100% 95 BRA-1
ADAETTE AGE integer 100% 38 38
ADAETTE AGEU factor 100% 1 YEARS
ADAETTE SEX factor 100% 2 M
ADAETTE RACE factor 100% 6 BLACK OR AFRICAN AMERICAN
ADAETTE ETHNIC factor 100% 4 HISPANIC OR LATINO
ADAETTE COUNTRY factor 100% 9 BRA
ADAETTE DTHFL factor 100% 2 N
ADAETTE INVID character 100% 95 INV ID BRA-1
ADAETTE INVNAM character 100% 95 Dr. BRA-1 Doe
ADAETTE ARM factor 100% 3 A: Drug X
ADAETTE ARMCD factor 100% 3 ARM A
ADAETTE ACTARM factor 100% 3 A: Drug X
ADAETTE ACTARMCD factor 100% 3 ARM A
ADAETTE TRT01P factor 100% 3 A: Drug X
ADAETTE TRT01A factor 100% 3 A: Drug X
ADAETTE TRT02P factor 100% 3 C: Combination
ADAETTE TRT02A factor 100% 3 A: Drug X
ADAETTE REGION1 factor 100% 6 South America
ADAETTE STRATA1 factor 100% 3 B
ADAETTE STRATA2 factor 100% 2 S1
ADAETTE BMRKR1 numeric 100% 400 4.15691403407286
ADAETTE BMRKR2 factor 100% 3 MEDIUM
ADAETTE ITTFL factor 100% 1 Y
ADAETTE SAFFL factor 100% 1 Y
ADAETTE BMEASIFL factor 100% 2 Y
ADAETTE BEP01FL factor 100% 2 Y
ADAETTE AEWITHFL factor 100% 2 N
ADAETTE RANDDT Date 100% 296 2020-03-08
ADAETTE TRTSDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADAETTE TRTEDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADAETTE TRT01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADAETTE TRT01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADAETTE TRT02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADAETTE TRT02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADAETTE AP01SDTM POSIXct/POSIXt 100% 400 2020-03-08 05:39:28.683
ADAETTE AP01EDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADAETTE AP02SDTM POSIXct/POSIXt 82% 328 2021-02-14 14:58:26.683
ADAETTE AP02EDTM POSIXct/POSIXt 82% 328 2022-02-14 20:58:26.683
ADAETTE EOSSTT factor 100% 3 DISCONTINUED
ADAETTE EOTSTT factor 100% 3 DISCONTINUED
ADAETTE EOSDT Date 82% 179 2022-02-14
ADAETTE EOSDY integer 82% 114 709
ADAETTE DCSREAS factor 30% 8 PROTOCOL VIOLATION
ADAETTE DTHDT Date 18% 44 2022-03-16
ADAETTE DTHCAUS factor 18% 8 ADVERSE EVENT
ADAETTE DTHCAT factor 18% 4 ADVERSE EVENT
ADAETTE LDDTHELD integer 18% 36 24
ADAETTE LDDTHGR1 factor 18% 3 <=30
ADAETTE LSTALVDT Date 82% 219 2022-03-09
ADAETTE DTHADY integer 18% 67 496
ADAETTE ADTHAUT factor 14% 3 Yes
ADAETTE ASEQ integer 100% 9 6
ADAETTE TTESEQ integer 100% 9 6
ADAETTE PARAM factor 100% 9 Time to end of AE reporting period
ADAETTE PARAMCD factor 100% 9 AEREPTTE
ADAETTE AVAL numeric 100% 1460 1.94113620807666
ADAETTE AVALU factor 67% 3 YEARS
ADAETTE ADTM POSIXct/POSIXt 67% 2155 2022-02-14
ADAETTE ADY integer 67% 635 709
ADAETTE CNSR integer 67% 3 0
ADAETTE EVNTDESC character 47% 6 Completion or Discontinuation
ADAETTE CNSDTDSC character 53% 7

The inventory helps us discriminate which variables to use for each analysis:

  • Categorical variables with few distinct values (e.g. ARM, SEX, RACE, PARAMCD) → grouping / stratification factors.
  • Continuous numeric columns (e.g. AVAL, CHG, BASE, AGE, BMRKR1) → outcome or covariate in regressions, boxplots, trajectories.
  • Date / time columns (e.g. TRTSDTM, TRTEDTM, ADTM) → time-on-treatment, duration calculations.
  • Censoring indicators (CNSR, EVNTDESC) → time-to-event (survival) modelling.
  • Visit identifiers (AVISIT, AVISITN) → longitudinal panel structure, repeated-measures models.

The cached data ships with 400 subjects. For the purpose of the intent, we are subsampling to match the intended scale. In a production system this step would not exist—the warehouse query would simply return the real cohort.

selected_ids <- adsl %>%
  distinct(USUBJID) %>%
  slice_sample(n = 200) %>%
  pull(USUBJID)

adsl    <- adsl    %>% filter(USUBJID %in% selected_ids)
advs    <- advs    %>% filter(USUBJID %in% selected_ids)
adlb    <- adlb    %>% filter(USUBJID %in% selected_ids)
adaette <- adaette %>% filter(USUBJID %in% selected_ids)

After subsampling we have 200 patients across 63 sites and 3 treatment arms (A: Drug X, B: Placebo, C: Combination).


2. Demographic overview by treatment arm

The first step in any clinical analysis is understanding who is in the study. In a multi-centre trial we need to verify that treatment arms are balanced with respect to key covariates—imbalances in age or sex could confound every downstream comparison. In DM1 specifically, age at onset correlates with CTG repeat length, so any distributional skew is a red flag.

We show two complementary views:

  • Left panel: patient counts by treatment arm and sex (are groups balanced?).
  • Right panel: overlaid age density curves by arm (are the age distributions similar?).
# Patient counts by arm and sex
demo_summary <- adsl %>%
  count(ARM, SEX, name = "n_patients") %>%
  mutate(ARM = fct_reorder(ARM, n_patients, .fun = sum))

p1_left <- ggplot(demo_summary, aes(x = ARM, y = n_patients, fill = SEX)) +
  geom_col(position = "dodge", width = 0.7, colour = "grey30", linewidth = 0.3) +
  scale_fill_manual(
    values = c("F" = "#E07B91", "M" = "#6BAED6"),
    labels = c("F" = "Female", "M" = "Male")
  ) +
  labs(x = NULL, y = "Number of patients", fill = "Sex") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        legend.position = "bottom")

# Age density by arm
p1_right <- ggplot(adsl, aes(x = AGE, fill = ARM, colour = ARM)) +
  geom_density(alpha = 0.25, linewidth = 0.6) +
  labs(x = "Age (years)", y = "Density",
       fill = "Treatment arm", colour = "Treatment arm") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

if (is_html) {
  subplot(
    ggplotly(p1_left,  tooltip = c("x", "y", "fill")),
    ggplotly(p1_right, tooltip = c("x", "y", "fill")),
    nrows = 1, shareY = FALSE, titleX = TRUE, titleY = TRUE, margin = 0.06
  ) %>%
    layout(
      title = list(text = "Demographic Overview by Treatment Arm", x = 0.5),
      legend = list(orientation = "h", y = -0.15, x = 0.5, xanchor = "center"),
      margin = list(t = 80, b = 80)
    ) %>%
    add_plotly_config()
} else {
  grid.arrange(p1_left, p1_right, ncol = 2)
}

Fig. 1. Demographic overview by treatment arm. Left: patient counts by arm and sex. Right: overlaid age density curves by treatment arm.

The three arms are roughly balanced in size and sex ratio. Age distributions overlap substantially, suggesting randomisation achieved its goal. A formal test (e.g. ANOVA on age, chi-square on sex) could confirm this, but for an exploratory pass the visual check is sufficient.


3. Vital signs trajectories over time

DM1 is a multi-systemic disease with well-documented cardiovascular involvement—cardiac conduction defects, arrhythmias, and in some cohorts altered blood pressure regulation. Monitoring systolic blood pressure (SBP) across scheduled study visits is therefore clinically meaningful.

We use a spaghetti-plus-mean design:

  • Thin grey lines: individual patient trajectories, revealing the full range of within-patient variability.
  • Coloured ribbon + line: group mean \(\pm\) 95% confidence interval by arm, showing the central trend.

This layered approach lets the reader simultaneously assess heterogeneity and treatment-level patterns without suppressing the underlying data.

sbp <- advs %>%
  filter(
    PARAMCD == "SYSBP",
    AVISIT  != "",
    !is.na(AVAL),
    !is.na(AVISITN)
  ) %>%
  select(USUBJID, ARM, AVISIT, AVISITN, AVAL)

# Summary statistics per arm per visit
sbp_summary <- sbp %>%
  group_by(ARM, AVISIT, AVISITN) %>%
  summarise(
    mean_val = mean(AVAL, na.rm = TRUE),
    se       = sd(AVAL, na.rm = TRUE) / sqrt(n()),
    n        = n(),
    .groups  = "drop"
  ) %>%
  mutate(
    lo = mean_val - 1.96 * se,
    hi = mean_val + 1.96 * se
  )

p2_static <- ggplot() +
  geom_line(
    data = sbp,
    aes(x = AVISITN, y = AVAL, group = USUBJID),
    alpha = 0.08, linewidth = 0.3, colour = "grey50"
  ) +
  geom_ribbon(
    data = sbp_summary,
    aes(x = AVISITN, ymin = lo, ymax = hi, fill = ARM),
    alpha = 0.25
  ) +
  geom_line(
    data = sbp_summary,
    aes(x = AVISITN, y = mean_val, colour = ARM),
    linewidth = 0.9
  ) +
  geom_point(
    data = sbp_summary,
    aes(x = AVISITN, y = mean_val, colour = ARM),
    size = 1.8
  ) +
  facet_wrap(~ ARM, nrow = 1) +
  scale_x_continuous(breaks = sort(unique(sbp_summary$AVISITN))) +
  labs(
    x = "Analysis visit number",
    y = "Systolic BP (Pa)",
    colour = "Arm", fill = "Arm"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold"))

if (is_html) {
  arm_colours <- setNames(
    scales::hue_pal()(n_distinct(sbp_summary$ARM)),
    levels(sbp_summary$ARM)
  )

  p2_ly <- plot_ly()
  for (arm in unique(sbp_summary$ARM)) {
    arm_sbp  <- sbp %>% filter(ARM == arm)
    arm_summ <- sbp_summary %>% filter(ARM == arm) %>% arrange(AVISITN)

    # Individual traces (light, no legend entry)
    for (uid in unique(arm_sbp$USUBJID)) {
      d <- arm_sbp %>% filter(USUBJID == uid)
      p2_ly <- p2_ly %>% add_lines(
        data = d, x = ~AVISITN, y = ~AVAL,
        line = list(color = "rgba(160,160,160,0.12)", width = 0.8),
        hoverinfo = "text",
        text = ~paste0("Patient: ", USUBJID, "<br>Visit: ", AVISIT,
                       "<br>SBP: ", round(AVAL, 1)),
        showlegend = FALSE, legendgroup = arm
      )
    }

    # 95% CI ribbon
    p2_ly <- p2_ly %>%
      add_ribbons(
        data = arm_summ, x = ~AVISITN, ymin = ~lo, ymax = ~hi,
        fillcolor = paste0(arm_colours[arm], "33"),
        line = list(color = "transparent"),
        showlegend = FALSE, legendgroup = arm
      )

    # Mean line + points
    p2_ly <- p2_ly %>%
      add_trace(
        data = arm_summ, x = ~AVISITN, y = ~mean_val,
        type = "scatter", mode = "lines+markers",
        line = list(color = arm_colours[arm], width = 2.5),
        marker = list(color = arm_colours[arm], size = 7),
        name = arm, legendgroup = arm,
        hoverinfo = "text",
        text = ~paste0(arm, "<br>Visit: ", AVISIT,
                       "<br>Mean SBP: ", round(mean_val, 1),
                       "<br>95% CI: [", round(lo, 1), ", ", round(hi, 1), "]",
                       "<br>n = ", n)
      )
  }

  p2_ly %>% layout(
    title  = list(text = "Systolic Blood Pressure Trajectories Over Visits"),
    xaxis  = list(title = "Analysis visit number"),
    yaxis  = list(title = "Systolic BP (Pa)"),
    legend = list(orientation = "h", y = -0.15),
    hovermode = "closest"
  ) %>%
  add_plotly_config()
} else {
  p2_static
}

Fig. 2. Systolic blood pressure trajectories over scheduled study visits. Individual patient traces (grey) are overlaid with group mean ± 95 % CI by treatment arm.

Mean SBP stays relatively stable across visits in all three arms, with wide individual variability (a realistic pattern for a heterogeneous disease). In a real DM1 dataset we would also overlay cardiac conduction metrics (PR interval, QTc) from the ECG domain—cross-modal exploration that the proposed system is designed to support.


4. Laboratory biomarker change from baseline

In DM1, hepatic and inflammatory markers deserve close surveillance:

  • ALT (Alanine Aminotransferase): a standard liver-function marker. Many candidate therapies for muscular dystrophies carry hepatotoxicity risk.
  • CRP (C-Reactive Protein): a systemic inflammation marker increasingly studied in muscular dystrophies, where chronic low-grade inflammation accompanies muscle degeneration.

Showing change from baseline (CHG) rather than raw values lets us focus on within-patient shifts—the natural framing for a longitudinal study and for the materialised views we proposed in the analysis system (the patient-level summary view stores exactly this kind of derived quantity).

lab_chg <- adlb %>%
  filter(
    PARAMCD %in% c("ALT", "CRP"),
    AVISIT  != "",
    !is.na(CHG),
    !is.na(AVISITN)
  ) %>%
  select(USUBJID, ARM, PARAMCD, PARAM, AVISIT, AVISITN, CHG) %>%
  mutate(AVISIT = fct_reorder(AVISIT, AVISITN))

p3_static <- ggplot(lab_chg, aes(x = AVISIT, y = CHG, fill = ARM)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  geom_boxplot(
    outlier.size = 0.7, outlier.alpha = 0.4,
    linewidth = 0.35, width = 0.7,
    position = position_dodge(width = 0.8)
  ) +
  facet_wrap(~ PARAM, scales = "free_y", ncol = 1) +
  labs(
    x    = "Scheduled visit",
    y    = "Change from baseline",
    fill = "Treatment arm"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1, size = 8),
    legend.position = "top",
    strip.text = element_text(face = "bold", size = 11)
  )

if (is_html) {
  params <- unique(lab_chg$PARAMCD)
  p3_panels <- lapply(params, function(pc) {
    d <- lab_chg %>% filter(PARAMCD == pc)
    plot_ly(
      data = d, x = ~AVISIT, y = ~CHG, color = ~ARM,
      type = "box",
      hoverinfo = "y+name",
      showlegend = (pc == params[1])
    ) %>%
      layout(
        boxmode = "group",
        annotations = list(
          text = unique(d$PARAM), xref = "paper", yref = "paper",
          x = 0.5, y = 1.06, showarrow = FALSE,
          font = list(size = 14, face = "bold")
        ),
        shapes = list(list(
          type = "line", x0 = 0, x1 = 1, xref = "paper",
          y0 = 0, y1 = 0, line = list(color = "grey", dash = "dash")
        ))
      )
  })

  subplot(p3_panels, nrows = length(params), shareX = TRUE, titleY = TRUE) %>%
    layout(
      title  = list(text = "Change from Baseline in ALT and CRP Over Visits"),
      yaxis  = list(title = "Change from baseline"),
      yaxis2 = list(title = "Change from baseline"),
      boxmode = "group",
      legend  = list(orientation = "h", y = -0.1),
      margin  = list(t = 70)
    ) %>%
    add_plotly_config()
} else {
  p3_static
}

Fig. 3. Change from baseline in ALT and CRP across study visits by treatment arm. The dashed line marks zero change; boxes show the inter-quartile range with whiskers extending to 1.5 × IQR.

Both ALT and CRP change distributions remain centred around zero with no obvious arm-level divergence—a reassuring safety signal. The inter-quartile ranges widen slightly at later visits, consistent with increasing variability as follow-up time grows and some patients are lost. In a real DM1 study, we would flag individual patients whose ALT exceeds 3\(\times\) ULN (Hy’s Law boundary) and join their data with the genomic modality to look for variant-level associations—exactly the cross-modal workflow the system is designed to enable.


5. Kaplan-Meier: time to first adverse event

Time-to-event analysis is the most information-rich way to compare safety profiles across treatment arms. The Kaplan-Meier estimator gives non-parametric survival curves; the log-rank test provides a formal between-arm comparison.

In the proposed analysis system, this would be exposed as a parameterised module: the researcher selects the event of interest (any AE, serious AE, grade 3–5 AE), stratification variables, and optional covariates via a configuration file, and the pipeline produces the curves and test results automatically.

# Available time-to-event endpoints
tte_params <- adaette %>% distinct(PARAMCD, PARAM)
knitr::kable(tte_params, col.names = c("Code", "Description"))
Code Description
AEREPTTE Time to end of AE reporting period
AETOT1 Number of occurrences of any adverse event
AETOT2 Number of occurrences of any serious adverse event
AETOT3 Number of occurrences of a grade 3-5 adverse event
AETTE1 Time to first occurrence of any adverse event
AETTE2 Time to first occurrence of any serious adverse event
AETTE3 Time to first occurrence of a grade 3-5 adverse event
HYSTTEBL Time to Hy’s Law Elevation in relation to Baseline
HYSTTEUL Time to Hy’s Law Elevation in relation to ULN

We select AETTE1 (time to first occurrence of any adverse event)—the broadest safety endpoint.

tte_data <- adaette %>%
  filter(
    PARAMCD == "AETTE1",
    !is.na(AVAL),
    !is.na(CNSR)
  ) %>%
  select(USUBJID, ARM, AVAL, CNSR, PARAM) %>%
  distinct(USUBJID, .keep_all = TRUE) %>%
  mutate(
    event    = 1 - CNSR,
    time_wks = AVAL / 7
  )

event_label <- tte_data %>% pull(PARAM) %>% unique() %>% first()

# Kaplan-Meier fit
surv_obj <- Surv(time = tte_data$time_wks, event = tte_data$event)
km_fit   <- survfit(surv_obj ~ ARM, data = tte_data)

# Log-rank test
lr_test <- survdiff(surv_obj ~ ARM, data = tte_data)
lr_pval <- pchisq(lr_test$chisq, df = length(lr_test$n) - 1, lower.tail = FALSE)

Events observed: 125 / 200 patients experienced the event. Log-rank p-value: 0.0446.

km_tidy <- tidy(km_fit) %>%
  mutate(ARM = str_remove(strata, "^ARM="))

# Number-at-risk at evenly spaced time points
risk_times   <- seq(0, max(km_tidy$time, na.rm = TRUE), length.out = 6) %>% round(1)
risk_summary <- summary(km_fit, times = risk_times)
risk_tbl     <- tibble(
  time   = risk_summary$time,
  n.risk = risk_summary$n.risk,
  ARM    = str_remove(risk_summary$strata, "^ARM=")
)

if (is_html) {
  arm_colours <- setNames(
    scales::hue_pal()(n_distinct(km_tidy$ARM)),
    unique(km_tidy$ARM)
  )

  p4_ly <- plot_ly()
  for (arm in unique(km_tidy$ARM)) {
    d <- km_tidy %>% filter(ARM == arm) %>% arrange(time)

    # CI ribbon
    p4_ly <- p4_ly %>%
      add_ribbons(
        data = d, x = ~time, ymin = ~conf.low, ymax = ~conf.high,
        fillcolor = paste0(arm_colours[arm], "22"),
        line = list(color = "transparent"),
        showlegend = FALSE, legendgroup = arm,
        hoverinfo = "skip"
      )

    # Step curve
    p4_ly <- p4_ly %>%
      add_trace(
        data = d, x = ~time, y = ~estimate,
        type = "scatter", mode = "lines",
        line = list(color = arm_colours[arm], width = 2.2, shape = "hv"),
        name = arm, legendgroup = arm,
        hoverinfo = "text",
        text = ~paste0(
          arm,
          "<br>Time: ", round(time, 1), " weeks",
          "<br>Event-free: ", scales::percent(estimate, accuracy = 0.1),
          "<br>95% CI: [", scales::percent(conf.low, accuracy = 0.1),
          ", ", scales::percent(conf.high, accuracy = 0.1), "]",
          "<br>n.risk: ", n.risk,
          "<br>n.event: ", n.event
        )
      )

    # Censor tick marks
    censored <- d %>% filter(n.censor > 0)
    if (nrow(censored) > 0) {
      p4_ly <- p4_ly %>%
        add_markers(
          data = censored, x = ~time, y = ~estimate,
          marker = list(symbol = "line-ns", size = 8,
                        color = arm_colours[arm], line = list(width = 1.5)),
          showlegend = FALSE, legendgroup = arm,
          hoverinfo = "text",
          text = ~paste0("Censored (n=", n.censor, ")")
        )
    }
  }

  p4_ly %>% layout(
    title   = list(text = paste0(
      "Kaplan-Meier: Time to First Adverse Event",
      "<br><sup>Log-rank p = ", sprintf("%.4f", lr_pval), "</sup>"
    )),
    xaxis   = list(title = "Time (weeks)"),
    yaxis   = list(title = "Event-free probability",
                   tickformat = ".0%", range = c(0, 1)),
    legend  = list(orientation = "h", y = -0.15),
    hovermode = "closest",
    margin  = list(t = 80)
  ) %>%
  add_plotly_config()
} else {
  # Static fallback for github_document
  p_km <- ggplot(km_tidy, aes(x = time, y = estimate, colour = ARM, fill = ARM)) +
    geom_step(linewidth = 0.8) +
    geom_rect(
      aes(
        xmin = time,
        xmax = lead(time, default = max(time)),
        ymin = conf.low, ymax = conf.high
      ),
      alpha = 0.10, colour = NA
    ) +
    annotate(
      "text",
      x = max(km_tidy$time) * 0.65, y = 0.15,
      label = sprintf("Log-rank p = %.4f", lr_pval),
      size = 3.5, fontface = "italic", colour = "grey30"
    ) +
    scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
    labs(
      x      = "Time (weeks)",
      y      = "Event-free probability",
      colour = "Treatment arm",
      fill   = "Treatment arm"
    ) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "top")

  p_risk <- ggplot(risk_tbl, aes(x = time, y = ARM, label = n.risk)) +
    geom_text(size = 3) +
    labs(x = "Time (weeks)", y = NULL, title = "Number at risk") +
    theme_minimal(base_size = 9) +
    theme(
      panel.grid = element_blank(),
      plot.title = element_text(size = 9, face = "bold"),
      axis.text.y = element_text(face = "bold")
    )

  grid.arrange(p_km, p_risk, ncol = 1, heights = c(4, 1))
}

Fig. 4. Kaplan–Meier curves for time to first adverse event by treatment arm, with 95 % confidence bands and censoring tick marks. The log-rank test p-value is annotated.

The Kaplan-Meier curves separate modestly between arms, with a log-rank p-value of 0.0446—nominally significant at the 0.05 level. In a real DM1 study this would warrant further investigation with a Cox proportional-hazards model adjusting for baseline covariates (age, sex, disease severity, CTG repeat length), and the result would be cross-referenced with the genomic and proteomic modalities to identify biological correlates of adverse event risk.


Summary

The four analyses above trace a deliberate progression:

# Analysis Type System capability it demonstrates
1 Demographic overview Descriptive Fast queries over the ADSL materialised view
2 SBP trajectories Longitudinal / time Cross-visit exploration of vital signs
3 ALT & CRP change from BL Longitudinal / safety Change-from-baseline monitoring for dashboards
4 KM time to first AE Time-to-event / inferential Parameterised survival module for notebooks

In the proposed dynamic analysis system, each of these would be:

  • Runnable from a notebook (JupyterLab / RStudio) connected to the OMOP CDM warehouse via read-only, role-scoped credentials.
  • Surfaced in dashboards (Grafana for operational metrics, Apache Superset for research exploration).
  • Schedulable as automated reports via Airflow or Prefect DAGs, with output rendered to PDF/HTML and distributed to defined recipients.
  • Version-controlled and reproducible, with every result tied to a code version, data-cut date, and parameter set.